Database-Text Alignment via Structured Multilabel Classification

نویسندگان

  • Benjamin Snyder
  • Regina Barzilay
چکیده

This paper addresses the task of aligning a database with a corresponding text. The goal is to link individual database entries with sentences that verbalize the same information. By providing explicit semantics-to-text links, these alignments can aid the training of natural language generation and information extraction systems. Beyond these pragmatic benefits, the alignment problem is appealing from a modeling perspective: the mappings between database entries and text sentences exhibit rich structural dependencies, unique to this task. Thus, the key challenge is to make use of as many global dependencies as possible without sacrificing tractability. To this end, we cast text-database alignment as a structured multilabel classification task where each sentence is labeled with a subset of matching database entries. In contrast to existing multilabel classifiers, our approach operates over arbitrary global features of inputs and proposed labels. We compare our model with a baseline classifier that makes locally optimal decisions. Our results show that the proposed model yields a 15% relative reduction in error, and compares favorably with human performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-Label Classification of Short Text: A Study on Wikipedia Barnstars

A content analysis of Wikipedia barnstars personalized tokens of appreciation given to participants reveals a wide range of valued work extending beyond simple editing to include social support, administrative actions, and types of articulation work. Barnstars are examples of short semi-structured text characterized by informal grammar and language. We propose a method to classify these barnsta...

متن کامل

Flexible Text Segmentation with Structured Multilabel Classification

Many language processing tasks can be reduced to breaking the text into segments with prescribed properties. Such tasks include sentence splitting, tokenization, named-entity extraction, and chunking. We present a new model of text segmentation based on ideas from multilabel classification. Using this model, we can naturally represent segmentation problems involving overlapping and non-contiguo...

متن کامل

A Multiclassifier based Document Categorization System: profiting from the Singular Value Decomposition Dimensionality Reduction Technique

In this paper we present a multiclassifier approach for multilabel document classification problems, where a set of k-NN classifiers is used to predict the category of text documents based on different training subsampling databases. These databases are obtained from the original training database by random subsampling. In order to combine the predictions generated by the multiclassifier, Bayes...

متن کامل

Multilabel Classification through Structured Output Learning - Methods and Applications

Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi Author Hongyu Su Name of the doctoral dissertation Multilabel Classification through Structured Output Learning Methods and Applications Publisher School of Science Unit Department of Computer Science Series Aalto University publication series DOCTORAL DISSERTATIONS 28/2015 Field of research Information and Computer Science Manuscrip...

متن کامل

Diagnosis Code Prediction from Electronic Health Records as Multilabel Text Classification: A Survey

This article presents a survey on diagnosis code prediction from various information in Electronic Health Records (EHR): both unstructured free text and structured data. Particularly, our interests are in casting the problem as text classification with multiple sources and using neural network based models. We will first present previous work in this area and describe some simple baseline model...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007